Transportation is an important part of our lives. Our research focused on the taxis trips in New York City (NYC). Compared to other transportation means, the most noticeable advantage of taxi service was a combination of both: a high degree of mobility and schedule flexibility. This is essential especially in NYC, most people use a form of taxi service every day. For our project, we used four datasets to develop our research about taxi services in NYC. Three out of four datasets came from Kaggle. The first dataset gave us the weather data of NYC from 2016. For example, the weather dataset contains, ‘average temperature’, ‘maximum temperature’, ‘minimum temperature’. The second dataset provided information about NYC taxi drives in 2016. The values found consist of, ‘pickup_datatime’, ‘dropoff_datatime’, and ‘pickup_latitude’. The last one shows NYC restaurants’ inspection record including zip codes. The fourth dataset came from GitHub, and provides zip code, latitude, and longitude in the USA.
1.How do different factors affect the trip duration (weather, distance, and weekend or not)?
2.Do the peak hours for the weekday and the weekend distribute differently?
3.What areas have more dense pick-up spots for taxi drivers? Why?
For every dataset we converted data types, removed outliers to prepared all data for the join process, and used ‘group by’ to calculate new variables that we need. The first thing we did to the weather dataset was, drop all other columns but ‘date’ and ‘average temperature’. Then for taxi drives dataset, we removed all extra spaces, converted the date column to the DateTime format, and generated three columns to represent pickup, date, and day. Then, we joined these two dataframes. After they were joined, we created a dummy variable for the weekend and weekdays and defined the distance function by latitude and longitude. Then, we changed the type of columns related to longitude and latitude to float and dropped three useless columns. For the USA zipcode dataset, we changed the columns’ names to ‘ZIP CODE’, ‘LAT’, and ‘LONG’ during the join process. For the restaurants’ dataset, we deleted the extra records for the same restaurants and dropped rows that have missing values. We used ‘groupby’ to calculate the number of restaurants by zip code and then joined the third and fourth datasets.
# Import the modules
import glob
import pandas as pd
import numpy as np
from math import sin, cos, sqrt, atan2, radians
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import plotly.graph_objects as go
import datetime as dt
# Check the current path
%pwd
# Change the path
%cd C:/Users/yang7/OneDrive/Desktop/python
# weather dataset import
weather = pd.read_csv (r'C:/Users/yang7/OneDrive/Desktop/python/weather_data_nyc_centralpark_2016.csv')
weather.drop(["maximum temerature", "minimum temperature","precipitation","snow fall","snow depth"], axis = 1, inplace = True)
weather.head()
# 2nd dataset of taxi drives in NYC from 2016 Jan to 2016 June
sample = pd.read_csv (r'C:/Users/yang7/OneDrive/Desktop/python/NYC Taxi Data Set.csv')
sample.head()
sample.shape
# Delete the space in front of the start_date column if any
sample.pickup_datetime = sample.pickup_datetime.str.lstrip()
# Extract a part of date as a new list for future use
# Convert all kinds of object in pickup time column to datetime with one uniform format
import datetime
taxiDate = []
for i in range(1048575):
if sample.iat[i,2][1]=='/' and sample.iat[i,2][3]=='/' :
taxiDate.append(datetime.datetime.strptime(sample.iat[i,0][0:8], '%m/%d/%Y').strftime('%d-%m-%y'))
elif sample.iat[i,2][2]=='/' and sample.iat[i,2][4]=='/' :
taxiDate.append(datetime.datetime.strptime(sample.iat[i,0][0:9], '%m/%d/%Y').strftime('%d-%m-%y'))
elif sample.iat[i,2][2]=='/' and sample.iat[i,2][5]=='/' :
taxiDate.append(datetime.datetime.strptime(sample.iat[i,0][0:10], '%m/%d/%Y').strftime('%d-%m-%y'))
else:
taxiDate.append(datetime.datetime.strptime(sample.iat[i,2][0:10], '%d-%m-%Y').strftime('%d-%m-%y'))
# Generate three columns of pickup hour, date and the day of that day.
sample=sample.assign(pickup = ' ')
sample=sample.assign(date = ' ')
sample=sample.assign(day = ' ')
for i in range(1048575):
sample.iat[i, -3] = sample.iat[i,3][-5:-3]
sample.iat[i, -2] = taxiDate[i]
sample.iat[i, -1] = datetime.datetime.strptime(sample.iat[i,-2], '%d-%m-%y').strftime('%A')
sample.head()
# Join two tables
data=pd.merge(weather, sample, on='date', how='right')
# Create a variable for weekend/weekdays
data=data.assign(weekend = 'weekdays')
for i in range(1048575):
if data.iat[i, -2] == 'Saturday' or data.iat[i, -2] == 'Sunday':
data.iat[i, -1]= 'weekends'
# define the distance function by lans and lons
def distance(lat1,lon1,lat2,lon2):
# approximate radius of earth in km
R = 6373.0
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
# Create a new column and convert the lons and lats to floats
data=data.assign(distance_km = 0)
data[["pickup_longitude", "pickup_latitude","dropoff_longitude","dropoff_latitude"]] = data[["pickup_longitude", "pickup_latitude","dropoff_longitude","dropoff_latitude"]].astype(float)
data.dtypes
# Transform to distance
for i in range(1048575):
data.iat[i, -1]=distance(data.iat[i, 8],data.iat[i, 7],data.iat[i, 10],data.iat[i, 9])
# Drop the useless cols.
data.drop(["store_and_fwd_flag", "vendor_id","id","dropoff_datetime"], axis = 1, inplace = True)
data.dropna()
data.head()
# Import the 3rd dataset that has zipcode and locations
zipLocation = pd.read_csv('https://gist.githubusercontent.com/erichurst/7882666/raw/5bdc46db47d9515269ab12ed6fb2850377fd869e/US%2520Zip%2520Codes%2520from%25202013%2520Government%2520Data')
# Change the column names for the join process
zipLocation.columns = ['ZIPCODE', 'LAT','LONG']
# Import the 4th dataset that has NYC restaurants' inspection record
# We will only use the zipcode to locate and calculate the sum of restaurants in that area
restaurants = pd.read_csv (r'C:/Users/yang7/OneDrive/Desktop/python/New_York_City_Restaurant_Inspection_Results.csv')
restaurants.head()
# Delete the extra records of the same restaurants
restaurants = restaurants.drop_duplicates(subset=['CAMIS'], keep='first')
restaurants = restaurants[['CAMIS','DBA','ZIPCODE']].copy()
# Drop the rows with missing value
restaurants.dropna()
restaurants.head()
# Calculate the num of restaurants by zip code
restaurantsZip=restaurants.groupby(by='ZIPCODE')['DBA'].describe()
# Prepare for join
restaurantsZip=restaurantsZip.reset_index()
# Join
restaurantsLocation=pd.merge(restaurantsZip, zipLocation, on='ZIPCODE', how='left')
# Filter the zipcode with too less resturants
restaurantsLocation=restaurantsLocation[restaurantsLocation['count']>5]
# Convert the count column data type.
restaurantsLocation['count'] = restaurantsLocation[['count']].astype(float)
Since our goal is to know the relationship among trip duration and weather/ distance/ weekday/ weekend/ passenger number, we did a linear regression model about the log (trip duration) and chose ‘average temperature’, ‘distance_km’, ‘weekend’, ‘passenger_count’ as the independent variables. Below are OLS Regression Results and the residual plot.
# Make a subset copy to use in this question
regression=data[['average temperature', 'distance_km', 'weekend','trip_duration','passenger_count']].copy()
regression.head()
# Determine the independent variables
cols = ['average temperature', 'distance_km', 'weekend','passenger_count']
cat_cols = ['weekend']
# Set up variables for multiple linear regression model
X = pd.get_dummies(regression[cols], columns=cat_cols, prefix='', prefix_sep='', drop_first=True)
X = sm.add_constant(X)
y = np.log(regression['trip_duration'])
pd.concat([X, y], axis=1).head()
# Fit linear regression model and report results
model = sm.OLS(endog=y, exog=X)
results = model.fit()
results.summary()
# Plot the residual information
plt.figure(figsize=(8,8))
sns.set(style="whitegrid")
ax = sns.scatterplot(x=results.fittedvalues,y=results.resid, marker='o', data=regression)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals Plot')
As above, we tried the log-level regression model. First, we can tell that R-square is very small at 0.32, which means that only 32% of variation can be explained by our model. Second, although the P-values of all variables are extremely close to zero, which means the coefficients are significant, the coefficients of passenger count and average temperatures are small. Third, the residual plots showed many outliers in the dataset. It is supposed to show a bunch of random points around 0. However, we can see over 10 points with extreme error values. Therefore, we need to adjust the considered variables, also try to remove outliers and rerun the model.
# Make a copy
noOutlier=data
# Define a funtion to remove the outliers
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
# Apply to the variables
noOutlier=remove_outlier(noOutlier,'average temperature')
noOutlier=remove_outlier(noOutlier,'passenger_count')
noOutlier=remove_outlier(noOutlier,'trip_duration')
noOutlier=remove_outlier(noOutlier,'distance_km')
noOutlier.shape
cols = ['distance_km', 'weekend','passenger_count']
cat_cols = ['weekend']
# Set up variables for multiple linear regression model
X = pd.get_dummies(noOutlier[cols], columns=cat_cols, prefix='', prefix_sep='', drop_first=True)
X = sm.add_constant(X)
y = np.log(noOutlier['trip_duration'])
# Fit linear regression model and report results
model = sm.OLS(endog=y, exog=X)
results2 = model.fit()
results2.summary()
When the distance increase by 1 km, the trip duration will increase by 32.89%. When the passenger number increases by 1, the trip duration will increase by 3.12%. When it is weekend and other variables remain the same, the trip duration will decrease 13.38% compared to weekdays.
Although the R-square is only 40.7%, I think the included independent variables do have an impact on trip duration, especially the direct distance between pick-up and drop-off spots, whether it is weekend or not and the passenger number. We just need more information about other potential factors to make it better.
# Plot the residual
plt.figure(figsize=(8,8))
sns.set(style="whitegrid")
ax = sns.scatterplot(x=results2.fittedvalues,y=results2.resid, marker='o', data=noOutlier)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals Plot')
Here are the new OLS Regression Results and residual plot (ignored ‘average temperature’ since the coefficient is small and remove outliers).After, re-running the same model, we achieved a much better residual plot roughly within the range of -6 to.2. We can tell when the expected trip duration increases, we can get a more accurate result. When the expected trip duration is under 6, there is a bigger chance that the actual duration is shorter. This statistic result shows that all p-values look good to make sure coefficients in the model are significantly different from 0.
results2.params
#Confidence intervals ([0.025, 0.975])
results2.conf_int()
#Calcualte the error bar lengths for confidence intervals.
err_series = results2.params - results2.conf_int()[0]
err_series
coef_df = pd.DataFrame({'coef': results2.params.values[1:],
'err': err_series.values[1:],
'varname': err_series.index.values[1:]
})
coef_df
# Plot the coeffients with their error bar
fig, ax = plt.subplots(figsize=(8, 8))
sns.set(style="white")
coef_df.plot(x='varname', y='coef', kind='bar',
ax=ax, color='none',
yerr='err', legend=False)
ax=sns.scatterplot(x=pd.np.arange(coef_df.shape[0]),y=coef_df['coef'], marker='o', data=coef_df)
ax.set_ylabel('Dependent Variables')
ax.set_xlabel('Coefficient Value')
ax.set_title('Coefficients and Errors')
ax.axhline(y=0, linestyle='--', color='b', linewidth=2)
ax.xaxis.set_ticks_position('none')
ax.set_xticklabels(['distance_km', 'passenger_count', 'weekends'],
rotation=0, fontsize=16)
Then, we calculated the confidence interval and error for the confidence interval to check if the new regression is accurate. Based on the errors and coefficients of ‘distance_km’, ‘weekend’, ‘passenger_count’, we created a plot. We see that the errors are very small, which means the regression coefficient is measured precisely.
First, we sorted values by ‘pickup’. If ‘day’ is Saturday or ‘Sunday’, we counted it as ‘weekends’. According to 7 days in a week, we used ‘groupby’ to count passengers each day and then one plot. For the plot, the x-axis shows pick-up time and the y-axis shows the total passenger number.
noOutlier.shape
noOutlier.head()
# Convert the pickup column into a datetime
for i in range(798405):
noOutlier.iat[i,-4]=dt.datetime.strptime(noOutlier.iat[i,-4], '%H')
# Sort by the pickup hour
peak=noOutlier.sort_values('pickup')
# Make a dateframe of every day(monday/tuesday/...) and time and there passenger count information
peak.pickup=peak.pickup.dt.strftime("%H:00")
allPeak=peak.groupby(by=['pickup','day'])['passenger_count'].describe()
allPeak=allPeak.reset_index()
allPeak=allPeak.assign(weekend = 'weekdays')
for i in range(168):
if allPeak.iat[i, 1] == 'Saturday' or allPeak.iat[i, 1] == 'Sunday':
allPeak.iat[i, -1]= 'weekends'
plt.figure(figsize = (16,9))
sns.set(style="darkgrid")
palette = sns.color_palette("mako_r", 7)
# Plot the overall peak hours changes in a graph
# Use different ways to show weekdays or weekends
# Use different color to show different days
overall=sns.lineplot(x='pickup', y="count",
hue='day',style='weekend',
palette = palette,markers = ["o", "s"],
data=allPeak)
for item in overall.get_xticklabels():
item.set_rotation(60)
plt.title("Taxi Peak Hours Trend in NYC", fontsize = 15)
plt.xlabel("Pick-up Time", fontsize = 15)
plt.ylabel("The total number of passengers", fontsize = 15)
plt.show()
From the plot, we can tell that from Mondays to Thursdays the numbers of passengers are following the same pattern/trend, which is different from the weekend's pattern. Peak hours for Friday, Saturday, and Sunday tend to be later. Friday nights around 11 pm the number of passengers remains the same level instead of decreases like weekdays. The demand for a taxi at midnight goes up since Thursday night and then gets much bigger on weekends. No matter which day, the lowest demand is around 4 is and the highest demand is around 7 pm. However, for weekdays the second biggest peak is around 8 am, when is a reasonable time for going to work. For weekends, there are other two small peaks. One is around 1 am, when people hang out, and the other is 1 pm, probably implies the trips for lunch. Saturdays and Sundays have very similar trends overall except for the night. They both have a higher demand from 12 am to 4 am than weekdays.
We decided to use a mapboxto create a map for this question. We generated a column of random values first. Then we sorted the table by the random numbers. Using for loop, we added a dummy variable called ‘sample’ that tags every 20th row. The random sampling helped us decrease the huge dataset we got to 5% of its sample size. After sampling, the sample size became 79841. When loading the data by using mapbox, we can see an NYC map that shows pick up spots and restaurants density by zip codes.
# Your mapbox token
mapbox_access_token = 'pk.eyJ1Ijoid2FueXVuLXlhbmciLCJhIjoiY2syb3E4cTU5MTZhbDNtbzNyejRxZDAzbSJ9.V9aZq1zuZ7bovxHrjfce6g'
# Add a dummy variable to the DataFrame called "sample" that tags every 20th row to be included in the sample
# 10% of total data size
mapSample=noOutlier.assign(sample = 0)
mapSample.shape
# Assign random numbers
np.random.seed( 30 )
mapSample['random number'] = np.random.randint(0,1000,size=(len(mapSample),1))
mapSample=mapSample.sort_values('random number')
# Random sampling
for i in range(0,len(mapSample)):
if i % 20 == 0:
mapSample.iat[i, -2] = 1
mapSample
mapSample=mapSample[mapSample['sample']==1]
mapSample['pickup_longitude'].describe()
# the mean longitude of the sample is - -73.979750, which will be used as zoom center later.
mapSample['pickup_latitude'].describe()
# the mean latitude of the sample is 40.753085, which will be used as zoom center later.
mapSample.shape
taxi_map_data = go.Scattermapbox(
lon = mapSample ['pickup_longitude'],
lat = mapSample ['pickup_latitude'],
mode = 'markers',
marker = dict(
color = 'lightskyblue',
symbol = 'circle',
opacity = .5
),
name = "Taxi pickup locations"
)
taxi_map_data2 = go.Scattermapbox(
lon = restaurantsLocation['LONG'],
lat = restaurantsLocation['LAT'],
mode = 'markers',
text = restaurantsLocation['count'],
hoverinfo='text',
marker = dict(
color = 'pink',
size = restaurantsLocation['count']/30,
symbol = 'circle',
opacity = .9,
),
name = "Restaurant density by zipcode "
)
taxi_map_layout = go.Layout(
title = 'Taxi Pickup Locations & Restaurants Density in NYC (Size: The number of restaurants)',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
zoom=1
)
)
taxi_map = go.Figure(data=[taxi_map_data,taxi_map_data2], layout=taxi_map_layout)
taxi_map.update_layout(
hovermode='closest',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
bearing=0,
center=go.layout.mapbox.Center(
lat=40.753085,
lon= -73.979750
),
pitch=0,
zoom=10
)
)
taxi_map.show()
Manhattan is the most popular pick-up spots. The pink points in Manhattan are the densest. Also, the Brooklyn area has many spots along the river. Some points are around LGA airports. From the images, we can tell that the southern part of Manhattan has both more restaurants, which is a likely factor that shapes the pattern of the taxis' pick-up locations. Other pick up locations that are not in Manhattan also tend to be within areas that have more restaurants. We can conclude that areas with more restaurants are more popular for people to use taxi service.
# Zoom in Manhattan to look at the pattern closer
# The center is the central park in NYC
taxi_map2 = go.Figure(data=[taxi_map_data,taxi_map_data2], layout=taxi_map_layout)
taxi_map2.update_layout(
hovermode='closest',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
bearing=0,
center=go.layout.mapbox.Center(
lat=40.778424,
lon=-73.96175
),
pitch=0,
zoom=12
)
)
taxi_map2.show()
Then, we do a zoom-in to see the pattern closer. Within Manhattan, the pick-up spots in the southern part are more intense than the northern part in general. The most popular pick-up spots are located in the several blocks south of the Center Park.
In conclusion, our analysis shows that ‘distance_km’, ‘passenger_count’ has a positive impact on trip duration, and ‘weekend’ has a negative impact on trip duration. It also shows that the peak hours from Monday to Thursday distribute similarly and a little bit different from that for Friday. Peak hours for weekdays and weekends distribute differently. Manhattan is truly a popular pick-up area. Within it, the southern part tends to be more popular than other parts. After conducting this research, our finding can help the pick-up process more efficient for taxi drivers. They will know how to arrange their schedule based on the regression model and peak hours analysis to gain a better profit in an effective way. Also, they can pay more attention to the area with higher demand, so that customer’s waiting time can be minimized, and drivers can make more trips as well.